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This declaration is submitted under 37 C.F.R. § 1.131 in response to the rejection 
of U.S. Patent Application Serial Number 10/643,574 (the "Application"), assigned to 
Cray Inc. to establish the inapplicability of using the reference "Cray Assembly 
Language (CAL) for Cray XI™ Systems Reference Manual," published June, 2003 (the 
Cray Manual), to reject the claims of the instant application under 35 USC § 103(a). U.S. 
Patent Application Serial No. 10/643,574 was filed August 18, 2003. 

I, James R. Kohn, do hereby declare: 

1 . I am currently an employee of Cray Inc, the assignee of the Application and 
publisher of the Cray Manual, and have been an employee since at least as early 
as December 9, 2002. 

2. I am the inventor of the claims of the Application. 

3. The subject matter claimed in the Application was invented prior to June, 2003, 
the publication date of the Cray Manual. 

4. The enclosed document shows a copy of an original e-mail dated December 9, 
2002 that I (Jim Kohn) authored. The e-mail includes code that I created which 
implements the invention of the Application. 

5. The subject matter claimed in die Application was invented in the United States. 
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statements and the like are punishable by fine or imprisonment, or both, under 
Section 1001 of Title 18 of the United States Code, and that such willful false 
statements may jeopardize the validity of this application or any patent issuing 
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Original Message 

Subject: Re: Vector update vf unction 
Date: Mon, 09 Dec 2002 13:49:36 -0600 
From: Jim Kohn <jkohn@cray . com> 

To: Terry Greyzck <tdg@cray . com> , Vince Graziano <vjg@cray .com>, Charlie Carroll 
<charliec@cray.com>, Lyle Williams <lew@cray . com> 

References : <2 002110718 04 . gA7I4gSH112 722 21@mint . us . cray . com> <3DEE5E8B . FF0 8C906@cray . com> 
All, 

We need to start looking at what is required to get the 
new vector update algorithm into the compiler. This 
algorithm is multistreamable and on the HMG Tabletoy 
benchmark can achieve a 4x speedup over the current 
non-streamable algorithm. We are close to the 
customer's expectation. We would need to be able 
to demonstrate this in a compiler by June 2003. 

As a start, I've attached the inline version of the 
"indexed partial reduction" code that I used to achieve 
this speedup. I also show the bracketing code where 
the ordered msync masks are created and used to order 
the final update of memory. 

A possibility to consider is the inlining of vfunction 
code with a "suppress" of the scratch A, S, and V regs 
so that these are available to the vfunction code. The 
inlining of EXP has already been requested. The "indexed 
partial reduction" is a second instance. There may be 
later vfucntion algorithms that we may want to consider 
inlining as well such as pack, expand, search, etc. 
especially multistreamed versions of these. 

I will be out of the office for the rest of today and 
tomorrow attending a funeral. We can discuss this 
further on Wednesday morning or Thursday. 

Jim 



* HMG Tabletoy update: table [xdata . index [i] ] += xdata .value [i] 

* Registers computed or loaded during RHS processing of update... 

v2 [a27,2],m0 ; IX = xdata . index [*] 

vO cidx(all,m0) ; IOTA 

ml mO | mO ; input mask 

vl [a2 8,2],m0 ;Y = xdata. value [*] 

* Generate ordered msync wait, send masks 

* A10 = Remaining tripcount (after this pass) 

* All = 1 

* A22 = SSP# 

* A26 = SSP's array offset 

a24 a22"3 ;=0 iff P3 

a25 a0<a26 ;=0 iff P0 and 1st iter, else 1 

a24 al0|a24 ;=0 iff P3 and last iteration 

a21 a22-l 



1 



a26 
a23 
a21 
a23 
a22 
a21 
a2 3 
a21 
a22 



aO<a24 
a22 + l 
a21&3 
a23&3 
all<<a22 
a25<<a21 
a26<<a23 
a21 | a22 
a22 a23 



;=0 iff P3 and no more iters, else 1 
; restrict shift counts to be 0..3 
; self -mask 

;mask for SSP to wait on 
,-mask for SSP to send 

;wait mask 

; send mask 



Inlined "indexed partial reduction" algorithm: 



reduce (Y, IX) ,M1 



Y' will contain Y or sum reduced values of Y for duplicate IX values; 
Ml will contain an update mask where IX values are unique and also where 
the Y' elements that need to be added into the update (LHS) vector. 





vO = 


IOTA vector (0,1 


2, ... ,63) 




vl = 


Y vector 






v2 = 


IX vector 






ml 


= Input mask 






vl = 


#elements in vO, 


vl, v2 




Output 








vl = 


Y ' vector 






v2 = 


IX vector 






ml = 


Output mask of unique IX values 


CNFXSZ 


16384 


;Size of scratch conflict analysi 




S4 


CNFXSZ - 1 






a29 


vl 






a45 


CNFXSZ*8-8 






v5 


V2&S4 , m0 


/Conflict index set masked from ix 




m4 


fill (a29) 






m3 


ml&m4 


/Clear trailing mask bits beyond VL 




a20 


CNFXSZ* 8 






a45 


a63-a45 






S28 


8 






a63 


a63-a20 


/Allocate private stack space 




v6 


v2<<s28 , mO 


; (ix<<8) to make room for IOTA 




v4 


v6 | vO , mO 


; (ix<<8) | IOTA 




a2 7 


last (m4) 


;last valid element# 


cnfxloop = 




; "False positive" conflict loop 




[a45 ; 


v5] v4,m3,ord 


/Scatter (ix<<8)|lOTA (to scratch 




S27 


x' OOff :d 




lsync 


V, V 






v6 


[a45,v5] ,m3 


/Gather (ix<<8) ' | IOTA' 




v7 


+v6>>s28,m3 


/Extract ix' 




m2 


v7==v2 ,m3 


;M2 excludes ix's mapping to same CNFX 




v9 


v6&s27 , m3 


/Element #s of y sums 




m4 


v9 ! =v0 , m2 


/Conflict map 




m3 


~m2 &m3 


/Map of remaining ix values 




a6 


1 






a2 9 


pop (m4) 


/Conflict trip count (tc) 




v7 


cmprss (v9 , m4) 


/IOTA's that conflicts map to 




a26 


pop (m3 ) 


;>0 if ix's mapped to same CNFX 




ml 


~m4 &ml 


/Exclude conflicts in final Ml 




al 


v7, 0 


/1st iota into which to sum (iotal) 




a8 


a6<a29 


z=l if tc > 1 



2 



v7,a2 9 a2 7 

a6 a0<a29 
a7 a6+a8 



cmprss (vl,m4) 
a29,noconf lict 



/Store safe y sum index at end 
;=1 if tc > 0 
;=2 if tc > 1, else tc 

;2nd iota into which to sum (iota2) 
;3rd iota into which to sum (iota3) 

,-y values to add into y sums 
;If no conflicts exist 



all v! 


3,0 




/Get 1st 3 y values (yl,y2,y3) 




v8,a29 


SO 




/Store 0 for conflict summing at end 


al2 v8,a6 








S3 vi 


3 , a7 








$ RE PEAT 






/Repeat 3 update fixes per 


iteration 


a5 


a7<a29 




;=1 if >=0 more conflicts (another iter) 


s5 


vl,al 




/Get 3 y sums (to sum conflicts into) 


a2 3 


a2 A al 




/Determine conflict: iota2= 


=iotal 


a5 


a7+a5 








S6 


vl,a2 








a24 


a3 A al 




/Determine conflict: iota3= 


=iotal 


al5 


a5<a29 




;=1 if >=1 more conflicts 




S7 


vl, a3 








a25 






/Determine conflict: iota3= 


=iota2 


a6 


a5+al5 








al6 


al 




/Save iotal 




al 


v7 , a5 




/Bottom load next iter's iotal 


a7 


a6<a29 




;=1 if >=2 more conflicts 




al7 


a2 




/Save iota2 




a2 


v7,a6 




/Bottom load next iter's iota2 


a7 


a6+a7 








al8 


a3 




/Save iota3 




al3 


all 








Sl 


all 








all 


a24?a0 


:all 


,-yl if iota3 = = iotal, else 0 




a3 


v7 , a7 




/Bottom load next iter's iota3 


al3 


a23?a0 


:al3 


,-yl if iota2 = = iotal , else 0 




s2 


al2 








al2 


a25?a0 


:al2 


zy2 if iota3==iota2 , else 0 




sll 


all 








all 


v8 , a5 




/Bottom load next iter's yl 




sl3 


al3 








sl2 


al2 








al2 


v8 , a6 




/Bottom load next iter's y2 




s4,d 




S3+S11 


;y3 += (iota3==iotal) ? yl : 


0 


s3 


v8 , a7 




/Bottom load next iter's y3 




s2,d 




s2+sl3 


,-y2 += (iota2==iotal) ? yl : 


0 


s4,d 




S4+S12 


;y3 += (iota3==iota2) ? y2 : 


0 


s5,d 




S5 + S1 


/Suml += yl 




s6,d 




S6 + S2 


;Sum2 += y2 [+ yl] 




s7,d 




S7 + S4 


/Sum3 += y3 [+ yl] [+ 


y2] 



vl,al6 
vl,al7 
vl,al8 



noconf lict 



a2 6 , cnfxloop 



/Branch here if no conflicts 

/Repeat if more ix's mapped to same CNFX 



a63 a63+a20 



/Restore stack frame 



End of inlined "indexed partial reduction" algorithm. 



Update LHS using unique IX mask, Ml, and non-allocating gather/scatt< 
Use ordered (ripple) msyncs if multistreamed . 

msync a21,v ; Ordered msync 

v4 [a32,v2] ,ml,na /Gather TABLE [xdata . index [*] ] 
v5,d v4+vl,ml 

[a32,v2] v5,ml,ord,na ; scatter my updated TABLE values 

msync a22,v ,-End ordered msync 



4 



